Search CORE

38 research outputs found

A Web-Based Medical Text Simplification Tool

Author: Kauchak David
Leroy Gondy
Publication venue: 'HICSS Conference Office'
Publication date: 01/01/2020
Field of study

With the increasing demand for improved health literacy, better tools are needed to produce personalized health information efficiently that is readable and understandable by the patient. In this paper, we introduce a web-based text simplification tool that helps content-producers simplify existing text materials to make them more broadly accessible. The tool uses features that provide concrete suggestions and all features have been shown individually to improve the understandability of text in previous research. We provide an overview of the tool along with a quantitative analysis of the impact on medical texts. On a medical corpus, the tool provides good coverage with suggestions on over a third of the words and over a third of the sentences. These suggestions are over 40% accurate, though the accuracy varies by text source

Crossref

ScholarSpace at University of Hawai'i at Manoa

Modeling Word Burstiness Using the Dirichlet Distribution

Author: Elkan Charles
Kauchak David
Madsen Rasmus Elsborg
Publication venue
Publication date: 01/01/2005
Field of study

Multinomial distributions are often used to model text documents. However, they do not capture well the phenomenon that words in a document tend to appear in bursts: if a word appears once, it is more likely to appear again. In this paper, we propose the Dirichlet compound multinomial model (DCM) as an alternative to the multinomial. The DCM model has one additional degree of freedom, which allows it to capture burstiness. We show experimentally that the DCM is substantially better than the multinomial at modeling text data, measured by perplexity. We also show using three standard document collections that the DCM leads to better classification than the multinomial model. DCM performance is comparable to that obtained with multiple heuristic changes to the multinomial model. 1

CiteSeerX

Online Research Database In Technology

Data-driven sentence simplification: Survey and benchmark

Author: Abend Omri
Alva-Manchego Fernando
Alva-Manchego Fernando
Artetxe Mikel
Bach Nguyen
Bahdanau Dzmitry
Bingel Joachim
Biran Or
Bott Stefan
Bott Stefan
Bott Stefan
Carolina Scarton
Carroll John
Caseli Helena M.
Coster William
Coster William
Damay Jerwin Jan S.
De Belder Jan
Devlin Siobhan
Eom Soojeong
Feblowitz Dan
Fernando Alva-Manchego
Ganitkevitch Juri
Glavaš Goran
Gonzalez-Dios Itziar
Goodfellow Ian
Goto Isao
Guo Han
Heilman Michael
Kajiwara Tomoyuki
Kandula Sasikiran
Kauchak David
Klaper David
Klein Guillaume
Klerke Sigrid
Lin Chin Yew
Lucia Specia
Mandya Angrosh
Mikolov Tomas
Mirkin Shachar
Napoles Courtney
Narayan Shashi
Niklaus Christina
Ogden Charles Kay
Paetzold Gustavo
Paetzold Gustavo H.
Paetzold Gustavo Henrique
Papineni Kishore
Petersen Sarah E.
Post Matt
Quigley S. P.
Ranzato Marc’Aurelio
Robbins N. L.
Scarton Carolina
Scarton Carolina
Shardlow Matthew
Shewan Cynthia M.
Siddharthan Advaith
Siddharthan Advaith
Silveira Sara Botelho
Snover Matthew
Sun Hong
Vaswani Ashish
Vickrey David
Woodsend Kristian
Woodsend Kristian
Wubben Sander
Yatskar Mark
Zhang Xingxing
Zhu Zhemin
Štajner Sanja
Štajner Sanja
Štajner Sanja
Štajner Sanja
Publication venue: 'MIT Press - Journals'
Publication date: 15/09/2019
Field of study

Sentence Simplification (SS) aims to modify a sentence in order to make it easier to read and understand. In order to do so, several rewriting transformations can be performed such as replacement, reordering, and splitting. Executing these transformations while keeping sentences grammatical, preserving their main idea, and generating simpler output, is a challenging and still far from solved problem. In this article, we survey research on SS, focusing on approaches that attempt to learn how to simplify using corpora of aligned original-simplified sentence pairs in English, which is the dominant paradigm nowadays. We also include a benchmark of different approaches on common datasets so as to compare them and highlight their strengths and limitations. We expect that this survey will serve as a starting point for researchers interested in the task and help spark new ideas for future developments

Crossref

Online Research @ Cardiff

Spiral - Imperial College Digital Repository

White Rose Research Online

Author: David Kauchak
Publication venue
Publication date
Field of study

CiteSeerX

Feature-Based segmentation of narrative documents

Author: David Kauchak
Publication venue
Publication date
Field of study

In this paper we examine topic segmentation of narrative documents, which are characterized by long passages of text with few headings. We first present results suggesting that previous topic segmentation approaches are not appropriate for narrative text. We then present a feature-based method that combines features from diverse sources as well as learned features. Applied to narrative books and encyclopedia articles, our method shows results that are significantly better than previous segmentation approaches. An analysis of individual features is also provided and the benefit of generalization using outside resources is shown.

CiteSeerX

Recommended from our members

Contributions to research on machine translation

Author: Kauchak David
Publication venue: eScholarship, University of California
Publication date: 01/01/2006
Field of study

In the past few decades machine translation research has made major progress. A researcher now has access to many systems, both commercial and research, of varying levels of performance. In this thesis, we describe different methods that leverage these pre-existing systems as tools for research in machine translation and related fields. We first examine techniques for improving a translation system using additional text. The first method uses a monolingual corpus. Discrepancies are identified by translating a word list to a foreign language and back again. Entries where the original word and its double translation differ are used to learn word-level correction rules. The second method uses parallel bilingual data consisting of source language/target language sentence pairs. The source sentences are translated using a translation system, and a partial alignment is identified between the machine-translated sentences and the corresponding human-translated sentences in the target language. This alignment is used to generate phrase-level correction rules. Experimentally, both word-level and phrase-level correction rules result in improved translation performance. The learned word-level correction rules make 24,235 corrections on 20,000 Spanish to English translated sentences, with high accuracy. The learned phrase-level rules improve the translation performance (as measured by BLEU) of a French to English commercial system by 30%, and of a state of the art phrase-based system in a statistically significantly way. To train current statistical machine translation systems, bilingual examples of parallel sentences are used. Generating this data is costly, and currently feasible only in limited domains and languages. A fundamental question is whether every potential example is equally useful. We describe a ranking method for examples that scores individual sentence pairs based on the performance of translation systems trained on random subsets of the examples. When used to train a translation system, the top ranking examples result in a significantly better performing system than random selection of examples. Given these ranked examples, a model of example usefulness can potentially be learned to select the most useful unlabeled examples. Initial experiments show two previously used example features are good candidates for identifying useful examples. In the last part of this thesis we describe how automatic paraphrasing methods can be used to improve the accuracy of evaluation measures for machine translation. Given a human-generated reference sentence and a machine-generated translated sentence, we present a method that finds a paraphrase of the reference sentence that is closer in wording to the machine output than the original reference is. We show that using paraphrased reference sentences for evaluating a translation system output results in better correlation with human judgement of translation adequacy than using the original reference sentence

eScholarship - University of California

Contributions to research on machine translation

Author: Kauchak David
Publication venue: eScholarship, University of California
Publication date: 01/01/2006
Field of study

Ezid

eScholarship - University of California

Effect of Boosting in BWI

Author: David Kauchak
Publication venue
Publication date
Field of study

Recent work in information extraction has brought about a new method for text extraction using wrappers. A wrapper is a simple, but highly accurate extraction procedure. Unfortunately, these wrappers tend to have low recall. To remedy this problem, boosted wrapper induction (BWI) was proposed. This method combines a weak wrapper learner with AdaBoost to generate a more general extraction rule. The result is an algorithm with a bias towards precision, but with reasonable recall in both traditional extraction domains. The exact benefit of boosting over more traditional approaches is not always apparent. In this paper, we examine the benefits of boosting by comparing BWI to two different sequential covering algorithms with wrappers for text extraction in the framework of both highly structured and natural text. Sequential covering is a simple, straightforward algorithm which tries to cover as many possible positive examples with a single rule, removes the covered examples from the training set and continues until all of the positive examples have been covered. We show results from a broad range of information extraction tasks and show that the basic benefit of boosting in this domain is to allow BWI to continue learning new and helpful rules without over fitting the training data even after all of the positive examples are covered. This result is consistent with previous theoretical and experimental results.

CiteSeerX